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Abstract. In this paper we present a specialised algebraic manipulation package devoted to 
Celestial Mechanics. The system, called Piranha, is built on top of a generic and extensible frame- 
work, which allows to treat efficiently and in a unified way the algebraic structures most commonly 
encountered in Celestial Mechanics (such as multivariate polynomials and Poisson series). In this 
contribution we explain the architecture of the software, with special focus on the implementation 
of series arithmetics, show its current capabilities, and present benchmarks indicating that Piranha 
is competitive, performance- wise, with other specialised manipulators. 



1. Introduction. Since the late Fifties ([27]) researchers in the field of Celestial 
Mechanics have manifested a steady and constant interest in software systems able to 
manipulate the long algebraic expressions arising in the application of perturbative 
methods. Despite the widespread availability of commercial general-purpose algebraic 
manipulators, researchers have often preferred to develop and employ specialised ad- 
hoc programs (we recall here, without claims of completeness, [10], |31l 152"] . |48J . [7J, 
0], US, 03, EE], BS], [HIS] and [23]). The reason for this preference lies mainly in 
the higher performance that can be obtained by these. Specialised manipulators are 
built to deal only with specific algebraic structures, and thus they can adopt fast algo- 
rithm and data structures and avoid the overhead inherent in general-purpose systems 
(which instead are designed to deal with a wide variety of mathematical expressions). 
The performance gap between specialised and general-purpose manipulators is often 
measured in orders of magnitude, especially for the most computationally-intensive 
operations. 

The need for computer-assisted algebraic manipulation in Celestial Mechanics 
typically arises in the context of perturbative methods. In classical perturbative 
approaches, for instance, the equations of motion of celestial bodies are expanded into 
Fourier series having power series of "small" quantities (e.g., orbital eccentricity, orbital 
inclination, ratio of the masses) as coefficients. This algebraic structure, commonly 
referred to as Poisson series, allows to identify the terms relevant to the specific 
physical problem being considered and apply adequate methods of solution of the 
equations of motion (05], Chapters 7 and 8). Likewise, in the computation of normal 
forms in studies of dynamical systems the Hamiltonian is commonly expanded into 
power series, hence leading to multivariate polynomial algebraic expressions (|33 ). 
Another example is the computation of the tide-generating potential ( |19l [50] ). which 
involves the manipulation of Fourier series with numerical coefficients. 

In order to give a simple example, we recall here that classical perturbative meth- 
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ods employ Fourier series expansions like this one ([42], Chapter 2), 
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which represents the expansion of the cosine of the true anomaly / of the orbit of a 
celestial body in terms of cosines of the mean anomaly M up to fourth order in the 
eccentricity e. Such series are manipulated and combined with other series through 
operations such as addition, subtraction, multiplication, differentiation and symbolic 
substitution, leading to a final Fourier series representing the perturbing potential. 

All the mentioned applications require the ability to perform simple operations 
on series whose number of elements can grow remarkably large. In modern studies on 
the long-term stability of the Solar System, for instance, researchers have to deal with 
series of millions of terms ([37 ). The fact that in many cases it is possible to perform 
the most computationally-intensive calculations within the form of a single algebraic 
structure might explain the persisting interest in specific algebraic manipulators. 

To improve performance with respect to general-purpose systems, different spe- 
cialised manipulators adopt similar techniques which are a natural exploitation of 
the characteristics of the problems that they are built to tackle. Exponents in poly- 
nomials, for instance, are usually represented as hardware integers, instead of the 
arbitrary-size integers employed in general-purpose systems. Perturbative methods, 
indeed, assume that the quantities subject to exponentiation are small, so that in 
practice the exponents encountered in such calculations have values below 10 2 (but 
usually they are lower). Similarly, when calculating the harmonic expansion of the 
tide-generating potential it is often sufficient to represent the series' coefficients as 
standard double-precision floating point variables, whereas general-purpose manip- 
ulators must be able to operate with multiprecision floating point arithmetic and 
thus incur in an overhead that is unnecessary from the point of view of the celestial 
mechanician. 

The work presented here has four major objectives: 

1. to define a framework able to handle concisely and in a generalised fashion 
the algebraic structures most commonly encountered in Celestial Mechanics, 

2. to identify and implement efficient algorithms and data structures for the 
manipulation of said algebraic structures, 

3. to provide means to interact effectively and flexibly with the computational 
objects, and 

4. to provide an open system promoting participation and extensibility. 

We feel that many of the previous efforts in the field of algebraic manipulators de- 
voted to Celestial Mechanics have failed in at least one of these objectives. Many 
manipulators, for instance, have been written for very specific tasks, so that it can be 
difficult to extend or reuse them in other contexts. Others employ algorithms and data 
structures whose performance is not optimal. Yet others are fast and full-featured, 
but they are not open. 

While the work presented here is still in progress, preliminary benchmarks seem to 
indicate that our system, called Piranha, is competitive performance-wise with other 
more mature systems. In this contribution we illustrate Piranha's architecture and its 
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most noteworthy implementation details, with special emphasis on series arithmetic. 
We also present a brief overview of the Python interface and expound on benchmarks. 
Finally, we discuss the future direction of the project, including desirable features not 
yet implemented and possible performance improvements. 

2. Algebraic structures in Celestial Mechanics. As hinted in the introduc- 
tion, Celestial Mechanics employs diverse algebraic structures in the form of series. 
Multivariate polynomials appear naturally in perturbation theories in the form 



]rcix e », (2.1) 



where we have adopted the convention 

x 1 = x x 1 1 ... x n _ 1 (2-2) 

with i = (io, i\, . . . , i n —i) an integer multiindex, ei a vector of integer exponents 
indexed over i, Ci a numerical coefficient indexed over i and x = [xq,Xx, . . . , x n —i) a 
vector of symbolic variables. Although in perturbation theories series expansions have 
infinite terms, in practical calculations series are truncated to some finite order, so 
that the components of the i vectors vary on finite ranges. Please note that, strictly 



speaking, formula (2.1 1 represents a superset of multivariate polynomials, since the 



ei's components are allowed to assume negative values. It is hence more correct to 



refer to expressions in the form of (2.1 1 as multivariate Laurent series, as pointed out 
in [51]. Nevertheless, when referring to polynomials from now on we will encompass 
also Laurent series. 

In Celestial Mechanics another commonly-encountered algebraic structure is the 
so-called Poisson series f|17]l 

i j 

which consists of a multivariate Fourier series in the trigonometric variables t with 
multivariate Laurent series as coefficients. We refer to the elements of the integer vec- 
tor i as trigonometric multipliers. In practical applications it is sometimes convenient 
to represent the trigonometric parts of Poisson series through complex exponentials, 
hence defining a structure which we refer to as polar Poisson series: 

^^Cyx^exp^t-i)], (2.4) 

i j 

where i — y/—l. Poisson series arise routinely in the formulation of the gravitational 
disturbing function, where the symbolic variables represent the orbital elements of 
celestial bodies (02], Chapter 6). 

The subset of degenerate Poisson series with purely numerical coefficients is often 
referred to simply as Fourier series: 



COS (t-i). (2.5) 



Fourier series are used to express the periodic solutions of theories of motion of celestial 
bodies. They are often employed in problems involving the harmonic expansion of 
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Figure 2.1. Definition of series in Piranha. Echelon-0 series are series with numerical coef- 
ficients, echelon-n series with n > are series with echelon-(n — 1) series as coefficients. 



the tide-generating potential, where the theories of motion of celestial bodies are 
plugged into the expression of the potential and manipulated in order to be obtain a 
formulation of the potential in Fourier series form. 

Poisson series can be seen as a subset of echeloned Poisson series, which are 
defined by the following formula: 

yyvc ijk x^ 1 — j- cos ( t .i), (2.6) 

where d is a vector of symbolic variables (the so-called divisors), m a vector of integers 
indexed over the multiindices i, j and 1, and <5yi a positive integer value, again indexed 
over i, j and 1. Echeloned Poisson series are employed, for instance, in lunar theo- 
ries, where it is necessary to express symbolically the frequencies of the trigonometric 
variables. Such frequencies appear as divisors when integrating the equations of mo- 
tion with respect to time. While Poisson series manipulators are relatively common, 
echeloned Poisson series manipulators are less widespread (05], (30]). 

There are other less common algebraic structures for which specialised algebaric 
manipulators exist. We recall here the mainpulator described in 03], which targets 
Earth rotation theories and handles a superset of echeloned Poisson series called Ki- 
noshita series. 

In order to accomodate all the aforementioned algebraic structures, Piranha 
adopts a definition of series based on the following concepts: 

Definition 2.1. Terms are pairs constituted by a coefficient and a key. 
Definition 2.2. Two terms are considered equivalent if and only if their keys 

are. 

Definition 2.3. Series are sets of terms. 

Such concepts allow to introduce a recursive definition of series which is exempli- 



fied in Figure 2.1 and which was inspired by the definition of echeloned Poisson series 
in 05J. The lowest level of this scheme is populated by series whose coefficients are 
numerical, such as Fourier series (where we find trigonometric keys) and polynomials 
(where keys are multivariate monomials). This level represents the terminal point of 
the recursion and we refer to it as echelon-0. The next level, echelon-1, features series 
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whose coefficients are echelon-0 series, and it is populated by Poisson series (in both 
their traditional and polar forms). Similarly echelon-2, featuring echeloned Poisson 
series, is characterised by series whose coefficients are echelon-1 series. The following 
generalisation now follows straightforwardly: 

Definition 2.4. Echelon-0 series are series with numerical values as coeffi- 
cients. 

Definition 2.5. Echelon-n series are series with echelon-(n — 1) series as coef- 
ficients. 

At the present time, Piranha implements Fourier series, polynomials and Poisson 
series, with various possible combinations for the representation of numerical coef- 
ficients and keys. In our opinion, one of the key advantages of this framework is 
that it allows to treat different algebraic structures in a unified way and without 
any overhead. While it is possible, for instance, to use a Poisson series manipulator 
to operate on multivariate polynomials and Fourier series, this approach incurs in a 
performance penalty related to the unnecessary management of empty trigonomet- 
ric keys and monomials, respectively. By using the framework described here, it is 
instead possible to build specialised manipulators with little effort. 

3. A generically-programmed architecture. In order to implement in an 
efficient and concise way the recursive scheme described in ^2] we have chosen to 
adopt a programming paradigm which is both generic and object-oriented. The kind 
of genericity we seek is twofold: 

1. we want to be able to implement series types with different representations 
for coefficients and keys. When undertaking the task of the development 
of the disturbing function in planetary problems, for instance, it may be 
desirable to operate with multiprecision rational numbers as coefficients, in 
order to retain maximum precision. Likewise, in certain problems it may be 
enough to operate with polynomial exponents or trigonometric multipliers 
represented as 16 or even 8 bits integers as opposed to word-size integers, so 
that it is possible to adopt a more compact representation for keys in such 
cases; 

2. consequently, we want to be able to write algorithms for the manipulation of 
terms once and for all the different combinations of coefficients and keys, at 
the same time retaining the ability to specialise the algorithms under certain 
circumstances. 

The ultimate goal is to minimise code duplication and keep the source code as compact 
as possible. In a sense, we perceive Piranha as an attempt to implement the algebraic 
manipulation framework envisioned in |26] . 

Our language of choice for the implementation of Piranha's architecture is C++, 
in its flavour that is sometimes referred to as modern C++ (a term first popularised in 
[2]). This programming paradigm uses extensively C++ templates to achieve generic- 
ity at compile-time and to provide metaprogramming capabilities (meaning that the 
program is able to modify itself and automatically generate source code at compile- 
time). Modern C++ can be used to provide container classes where the type of the 
contained items is parametrised, to enable a type of polymorphism which is entirely 
resolved at compile-time (in contrast to the runtime polymorphism of traditional 
C++), and to write algorithms able to operate on generic types. Since template 
metaprogramming operates during the compilation of the program, its overhead at 
runtime is null, and thus it is particularly appealing in those contexts, such as scientific 
computing, where performance is paramount. 

5 



In Piranha, series are represented as template classes with coefficients and keys 
as parametrised types. A base series class provides the bulk of low-level operations 
on series. Base series are not intended to be used directly, instead they serve as 
a foundation for two other series classes, i.e., named series and coefficient series. 
Named series are intended to be employed directly by the user, and they are so called 
because they embed a description of their arguments in a n-tuple of vectors (where n 
is the echelon level plus one). The series types described in ^2] are all named series. 
Coefficient series, as the name suggests, are instead series serving as coefficients in 
other series. Laurent series as coefficients in Poisson series are an example of coefficient 
series. 

All the series classes implemented in Piranha inherit from the base series class 
and either from the named series class or the coefficient series class. A minimal series 
class defined this way is able to read and save series files, to interact with plain old 
C++ data types (i.e., integers and double-precision values), to produce a numerical 
evaluation of itself based on the substitution of arguments with numerical values and 
to perform series addition and subtraction. The capabilities of the series are then aug- 
mented, again through multiple inheritance, by the use of additional classes which we 
refer to as toolbox classes. Toolboxes are used to provide routines for more advanced 
tasks such as series multiplication, exponentiation to real powers, trigonometric func- 
tions, special functions, series expansions relevant to Celestial Mechanics and so on, 
and also to group together methods relevant only to specific series types. Toolboxes 
are also used to provide those capabilities expected from series with complex coeffi- 
cients and which are useless in series with real coefficients, such as the extraction of 
real and imaginary parts and full arithmetical interoperability between complex and 
real series. 

As an example, the Poisson series class at the present times inherits from the 
following classes: base series, named series, series multiplication toolbox, power series 
toolbox, special functions toolbox, Celestial Mechanics toolbox and common Poisson 
series toolbox. 

Toolboxes are generically programmed and may cross-reference each other. The 
special functions toolbox, for instance, requires the presence of a series multiplication 
toolbox (which might be already provided by the framework, but which can also be 
(re)implemented by the user); if the series multiplication toolbox is not available, 
a compile-time error will be produced. To avoid runtime overhead, the system of 
toolboxes extensively employs a design pattern known as curiously-recurring template 
pattern (HI]). This approach avoids runtime virtual-table lookups by achieving a 
form of polymorphism which is resolved entirely at compile-time (and which has been 
sometimes referred to as static polymorphism). 

At the time of this writing, Piranha supports the following coefficients types: 

1. standard double-precision floating-point, 

2. multiprecision integers, 

3. multiprecision rationals, 

and their complex counterparts. The multiprecision coefficients use the GMP bignum 
library (jSJ). The supported key types are: 

1. dynamically-sized trigonometric array (for Fourier and Poisson series), 

2. dynamically-sized multivariate monomial array (for polynomials). 

For both keys the user may choose as integer size either 8 or 16 bits at compile-time. 
Considering that Piranha currently supports three named series types (polynomials 
and Poisson/Fourier series), there are tens of possible manipulators implementable 
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with a one-line type definition. The following statemenlJJ for instance, defines a 
Fourier series class called df s with double-precision floating-point coefficients and 16 
bits trigonometric keys: 

typedef f our ier_ser ies <double_cf , trig_array <16> > dfs; 

In order to evaluate the global impact of the framework, we have run a simple 
SLOC (single line of code) analysis on Piranha and on two freely-available specialised 
manipulators called Gregoire and Colbert (|13j. |12] 1 which target Fourier series (Gre- 
goire) and Poisson series (Colbert) with floating-point coefficients. Gregoire and Col- 
bert, whose feature set is comparable to Piranha's, are written in Fortran and they 
consist of around 4000 SLOCs each. By comparison Piranha's SLOC count is about 
10 000, but, as pointed out earlier, Piranha really implements many manipulators 
with different characteristics and different representations for coefficients and keys. 
Another interesting detail of our simple SLOC analysis is that around 70% of Pi- 
ranha's source code is shared among all the available series types. 

4. Storage of terms. Series appearing in the context of Celestial Mechanics are 
generally sparse. By this we mean that, given finite upper and lower boundaries for 
the multiindices indexing the series described in ^2] most of the possible multiindices 
combinations will be associated to null terms. This is the main reason that led us to 
choose hashing as storage method for series in Piranha. 

We recall here briefly that hashing is an indexing technique based on a hash 
function that maps the items to be stored to integer values (hash values). Such 
values are used as indices in array-like structures called hash tables (sometimes called 
also dictionaries) . The choice of the hash function, the strategy of collision resolution 
(i.e., how to cope with different items hashing to the same value), the resize policy, 
all contribute to the performance of the hash table and they are thoroughly analysed 
in standard computer science textbooks such as [35] and [16]. We just recall that, 
when properly implemented, hash tables feature expected 0(1) complexity for lookup, 
insertion and deletion of elements. 

Given the definition of series provided in |2] it readily follows that, since terms are 
uniquely identified by their keys, the hash function will take the term's key as argu- 
ment and that coefficients have no role in the identifcation of a term. The presently- 
supported key types (i.e., trigonometric keys and multivariate monomials) can be seen 
as sequences of integers, with the flavour of trigonometric keys (i.e., sine or cosine) 
codified as a boolean flag: 



At the present time, Piranha implements the supported key types as dynamically- 
sized dense arrays, meaning that the integer values are stored in contiguous memory 
areas allocated dynamically and that null values are explicitly stored. This type of 
representation is referred to as distributed in the literature, as opposed to the recursive 
representation in which n-variate series are recursively represented as (n — l)-variate 
series with univariate series as coefficients (see [53] for a comparison of polynomial 
representations in computer algebra systems). 

lr The actual type definition in the source code is syntactically a bit more verbose, but we chose 
to omit some details for better readability. 




(4.1) 
(4.2) 
(4.3) 




[1,0,-1, false] , 
[3,4,-5]. 
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Both trigonometric keys and monomials are implemented on top of a base class 
called int_array, which provides common low-level functionality. Since, as mentioned 
earlier, we don't need the full range of word-size integers for the representation of 
keys, int_array employs an integer packing technique to store multiple sub- word-size 
integers into a word-size integer. The user can choose a size for the packed integers 
of either 8 or 16 bits at compile-time. The data layout of a multivariate monomial in 
eight variables on a 64 bits architecture with packed integer size of 16 bits will then 
look like this: 

no n\ n% n% n± n.5 ng 717 

16 bits 16 bits 

s ^ / 

64 bits 64 bits 

By using a C union it is possible to acces the memory space as an array of either 
word-size or packed-size integers. In addition to saving memory, this approach also 
allows to increase the performance of those operations which can work on multiple 
values independently; when comparing two keys, for instance, we can compare directly 
the word-size integers, hence avoiding the overhead of looping on the packed values 
and performing the comparison of multiple packed integers in one pass. Similarly, it is 
possible to compute the hash value of the key by combining (or mixing) the word-size 
integers instead of the packed integers. 

At the present time, Piranha relies on the hashing facilities provided by the Boost 
C++ libraries (|5]). Boost's hash_ combine function is based on a family of hash 
functions described in [46j, initially conceived for string hashing. This class of hash 
functions is refferred to as shift-add-xor because at each step of the iteration for the 
calculation of the hash value, operations of bit shifting, addition and bitwise exclusive 
OR are used. In Piranha the first word-size integer of each key is used as a seed value 
and then mixed with the other word-size integers of the key. For trigonometric keys, 
additional mixing is provided by the flavour of the key (cast as an integer value). The 
hash value is then used to store the term in a Boost hash_set data structure. This 
hashing container is a fairly standard hash set which uses prime numbers as sizes and 
the modulo operation for extracting a useful index from the hash value, and which 
employs separate chaining for collision resolution. 

The single point of entry for terms in the base series class is the insert () method. 
This method is a wrapper around hash_set's own insert () method which performs 
additional checks before actually inserting the term into the container. The most 
important checks are the following (in order of execution): 

1. term is ignorable: a term is always ignorable when either the coefficient or 
the key are mathematically equivalent to zero. Additionally, coefficients and 
keys may implement additional ignorability criterions (double-precision co- 
efficients, for instance, are considered ignorable when their absolute value is 
below a critical numerical threshold). Ignorable terms are simply discarded; 

2. term is in canonical form: the keys of certain series types might have multiple 
valid mathematical representations. The two trigonometric keys 

cos (x-y), 
cos (—x + y) , 

for instance, are mathematically equivalent, yet their straightforward trans- 
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Figure 4.1. Insertion of a term into a Fourier series. The term (1) is first canonicalised (2), 
then its hash value, h, is calculated from the key (3). The position in the hash table is calculated 
through the modulo operation (4), which, in this example, produces index number 6. Since the sixth 
slot of the hash table contains a term with a different key, the term being inserted is then chained 
in a linked list (5). 



lations into an array-like container will be different: 

[1,-1, true] , 
[—1, 1, true] . 

To remove such ambiguity, trigonometric keys, if necessary, are reduced dur- 
ing insertion to a unique canonical representation in which the first integer 
element is always positive (in case of sine trigonometric keys the canonicali- 
sation of the key must be followed by a change in the sign of the coefficient); 
3. term is not unique: before insertion of a non-ignorable term in canonical form 
it is checked whether an equivalent term exists in the series. This operation 
requires a lookup operation in the hash_set. If an equivalent term exists, 
the coefficient of the term being inserted is added to the existing equivalent 
term. Otherwise, the term is inserted as a new element. 
The base series' insert () method is used when loading series from files, when per- 



forming addition/subtraction of series (see S5.1 1 and in general whenever it is neces- 



sary to manipulate directly the terms of the series (such as in the expansions described 
in jSjl. Figure 4.1 visualises the insertion of a Fourier series term into Boost's hash_set 
data structure. 



5. Series arithmetic. Beside being used frequently in actual computations, 
arithmetic on series constitutes also the basis of more advanced manipulations (such as 
the series expansions described in ^6}. An efficient implementation of basic arithmetic 
is then a fundamental component of an efficient specialised manipulator. 

The sets of series types encountered in Celestial Mechanics typically constitute 
Abelian groups under the operations of addition and subtraction whose identity ele- 
ments are series with zero terms. Additionally, such sets are closed under the operation 
of multiplication, but, in general, it will not possible to express inverse series within 
a given set in finite terms. In the context of Celestial Mechanics, an effective way to 
compute real powers (and hence also the inversion) of certain families of series is the 
generalised binomial theorem, as described in ^6] 
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5.1. Addition and subtraction. The adoption of hashing techniques, as de- 
scribed in ^4] allows for a straightforward implementation of the operations of series 
addition and subtraction. To add a series S2 to a series Si, it will be enough to simply 
insert () all terms belonging to S2 into Si. As described earlier, the insertion routine 
will take care of dealing with equivalent terms; in case of a subtraction, the terms' 
signs will be changed before insertion. 

Since hash tables feature expected 0(1) complexity for element insertion, series 
addition and subtraction will feature expected 0(n) complexity, where n is the number 
of terms of the series being added or subtracted. 

5.2. Multiplication. Series multiplication is the most time-consuming among 
the basic operations that can be performed on series, and hence an obvious target for 
heavy optimisation. Generally speaking, to multiply two series Si and S 2 consisting 
of rii an d n-2 terms respectively, it will be necessary to multiply each term of Si by 
each term of S2. It follows then that it will be needed to perform nin% term- by-term 
multiplications and insertions into the resulting series. The complexity of series mul- 
tiplication is hence quadratic, or 0(n 2 ), with respect to term-by-term multiplications 
and insertion operations. 

We recall here that polynomials can be multiplied by using asymptotically-faster 
algorithms, like Karatsuba (,.'! 1| ; and FFT algorithms ([8]). Such algorithms however 
are not usually employed in Celestial Mechanics for the following reasons: 

• they work best on the assumption that the polynomials to be multiplied are 
dense (|2Q]). This is not generally the case in Celestial Mechanics; 

• these algortihms are practically faster than 0(n 2 ) multiplication for polyno- 
mial degrees that are too high to be encountered in typical Celestial Mechanics 
problems. 

We are not currently aware of algebraic manipulators specialised for Celestial Me- 
chanics using fast polynomial multiplication algorithms. 

We can implement multiplication similiarly to addition and subtraction, i.e., by 
performing term-by-term multiplications and accumulating the results into a newly- 
created series using the insert () method. Indeed, that's precisely what happens 
in Piranha for series in nonzero echelon levels, i.e., when dealing with coefficient 
series. In such cases the time needed for term-by-term multiplication dwarfs the time 
needed for term insertion, so it is of little benefit to optimise the performance of term 
accumulation. 

Echelon-0 series, however, feature numerical coefficients, and hence the cost of 
term multiplication might become comparable to or greater than the cost of term 
insertion (especially with double-precision coefficients). In this situation the cost of 
memory allocations and the poor spatial locality exhibited by hash tables employing 
separate chaining can become performance bottlenecks. For this reason in Piranha 
we adopt an intermediate representationFJfor the multiplication of suitable echelon-0 
series based on an procedure that we conceived autonomously, before learning that it 
is essentially an application of a known algorithm named Kronecker substitution ([36 , 

ED- 

5.2.1. Kronecker substitution. Kronecker substitution can be intuitively un- 
derstood in the case of multivariate polynomials with the aid of a simple table laying 



2 Other specialised manipulators employ intermediate representations during series multiplication. 
The specialised manipulator TRIP, for instance, uses burst tries ([25]) for the intermediate storage 
of series ([22]). 
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Table 5.1 

Kronecker substitution for a 3-variate polynomial up to the third power in each variable. 



out the monomials in lexicographic order. The lexicographic representation for three 
variables x, y and z, and up to the third power in each variable is displayed in Ta- 
ble |5.1| The column of codes is obtained by a simple enumeration of the exponents' 
multiindices. It can be noted how in certain cases the addition of multiindices (and 
hence the multiplication of the corresponding monomials) maps to the addition of 
their codified representation. For instance: 



c([0,0 ) 3])+c([0,l ) 0])=3 + 4 = 7 = c([0,l,3]) = c([0 ) 0,3] + [0,l,0]) 



(5.1) 



where we have noted with c — c (M) the function that codifies a multiindex vector M. 



An inspection of Table 5.1 promptly suggests that for an m-variate polynomial up to 
the n-th power, c's effect is equivalent to a scalar product between the multiindices 
vectors and a constant coding vector c defined as 



c= (n + 1) , (n + 1) 



.,71 + 1,1 



(5.2) 



so that eq. (5.1 1 can be generalised as 



c • (Mi + M 2 ) = c ■ Mi + c ■ M 2 



(5.3) 



While this equation is valid in general due to the distributivity of scalar multiplication, 
the codification of a single multiindex will produce a unique code only if the multi- 
index is representable within the representation defined by c. For the codification 
represented in Table |5.1| for instance, we have 



c= (3+1) 2 , 3 + 1,1 = [16,4,1] 



(5.4) 



If we try to code a monomial in which at least one of the exponents is greater than 



3 using eq. (5.4 1, we can see how the mapping between multiindices and codes is not 



univocal any more. For instance the following different multiindices will produce the 
same code: 
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[0,0, 4] -[16, 4,1] =4, (5.5) 
[0,1,0] -[16, 4,1] =4. (5.6) 

Kronecker substitution constitutes an addition preserving homomorphism be- 
tween the space of integer vectors whose elements are bound in a range and a subset 
of integers. It is equivalent to considering the elements of the multiindex as the digits 
of a number in base n + 1, and transforming it into its decimal counterpart. In the 
3-variate example above, for instance, the last row of the table, 333, is decimal 63 in 
base 4. It also provides a method to reduce multivariate polynomials into univariate 
ones, as the codes can be seen as exponents of a univariate polynomial. 

In Piranha we have adopted a generalisation of Kronecker substitution in which 

• we consider variable codification, i.e., each element of the multiindex has its 
own range of variability, 

• we extend the validity of the codification to negative integers and subtraction 
operations. 

The first generalisation allows to compact the range of the codes. If, for instance, the 
exponent of variable x in the example above varies only from to 1 (instead of varying 
from to 3 like for y and z), we can avoid codes that we know in advance will be 
associated to nonexisting monomials (i.e., all those in which x's exponent is either 2 or 
3). The second generalisation, which derives from the fact that under an appropriate 



coding vector it is possible to change M 2 's sign in eq. (5.31 retaining the validity 
of the homomorphism, allows to apply Kronecker substitution to the multiplication 
of Fourier series. Fourier series multiplication, indeed, requires the ability not only 
to add but also to subtract the vectors of trigonometric multipliers according to the 
elementary trigonometric formulas 

AB AB 

A cos a ■ B cos (3 = —— cos [a. — j3) H — — cos (a + (3) , 

AB . AB 
A cos a ■ B sin p — sin [a + p) sin (a — p) , 

AB AB ^ 

A sin a ■ B cos (3 = sin (a — p") H — — sin (a + (3) , 

AB AB 

A sin a ■ Bsin(3 = — ^— cos (a — (3) — cos (a + (3) . 

As far as we have been able to verify, we have not found any indication of previous 
uses of Kronecker substitution to perform multiplication of Fourier series. 

Table pO] shows the generalised Kronecker substitution for a multivariate polyno- 
mial (or Fourier series) in which the exponents (or trigonometric multipliers) vary on 
different ranges. If we define 

e = (e ,ei, . . . ,e m _i) , (5.8) 

^min / max (^0,min / maxi ^l,min / maxi ■ ■ • ) ^m— l,min / max) i (^-^) 
Ck = 1 + e /c,max ^fc,min> (5.10) 

c = (l, c , c ci, c cic 2 , . . . ,U^L Ck) , (5-11) 
X = c • e min , (5.12) 

it is easy to show that the code of the generic multiindex e is obtained by 



c(e) = c ■ e - x- 
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(5.13) 



m — 1 
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Code 


p i 
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c u,mm 
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Cm — 1 , min 
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H - Co, min 
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&m— l,min 




^l,min 




Co, max 




^0,max ^0,min 


Cm — l,min 
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~f~ ^l,min 




Co, min 
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H~ Co, max Co, min 


Cm — 1 , mi n 


. l 




l 


H - Co, min 


2 


H - Co, max Co, min 


Cm — l,min 


. l 


~t~ ^l,min 


2 


H~ Co, min 


3 


H~ CQ,max CQ,min 



Table 5.2 

Generalised Kronecker substitution for an m-variate polynomial (or Fourier series) in which 
each exponent (or trigonometric multiplier) varies on a different range. 



Codifying monomials and trigonometric keys into integers using Kronecker sub- 
stitution yields some advantages. First, just like in traditional integer packing tech- 
niques, we can save space by using a single integer to represent many integers. Second, 
we can map both multiindex addition and subtraction to the corresponding operation 
on a single integer. Finally, in the context of hashing techniques, the codes obtained 
through Kronecker substitution constitute perfect hash values, in the sense that the 
mapping between the multiindices and the corresponding codes is univocal. Then, 
depending on the range of the codes and on the characteristics of the input series, we 
can adopt two strategies to perform term accumulation during series multiplication: 

1. perfect hashing: if there is enough memory available and the input series are 
not much too sparse, then it is convenient to use the codes of the resulting 
terms directly as indices in an array (i.e., in a perfect hash table). In this way 
to accumulate series terms during multiplication it will be enough to multiply 
term- by-term the coefficients, add/subtract the codes and write directly in 
the memory address indicated by the newly-generated code; 

2. sparse hashing: if the input series are much too sparse or there is not enough 
memory, then the resulting codes can be reduced through the modulo op- 
eration to a smaller value and used to place the terms resulting from the 
multiplication in a standard (i.e., non-perfect) hash table. 

Both these algorithms are discussed in greater detail in |5.2.2| 

In order for Kronecker substitution to be effective, it is necessary that the codes 
vary within a range representable with hardware integers, otherwise the speed ben- 
efits related to the codification are reduced. On modern 64 bits architectures, this 
restriction is seldom limiting in the context of Celestial Mechanics. The monomials 
of a 6-variate polynomial, for instance, can be codified with Kronecker substitution 
into 64 bits unsigned integers up to exponent 1624 in all variables. In any case, if 
Kronecker substitution is not feasible, Piranha will revert to the algorithm, described 
above, used for nonzero echelon level series. 

The main disadvantage of using Kronecker substitution is the need to decode the 
resulting series. Depending on the density of the input series, the time needed to 
perform this step can become a relevant fraction of the total time needed for series 
multiplication. 
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5.2.2. Cache-friendly perfect and sparse hashing on codified series. 

Both the perfect and sparse hash algorithms mentioned above perform the follow- 
ing steps: 

1. code the input series into vectors of nonzero coefficient-code pairs using Kro- 
necker substitution; 

2. multiply coefficient by coefficient the two series, using the addition (and sub- 
traction, for Fourier series) of codes to establish the placement of each result- 
ing coefficient in an appropriate data structure; 

3. decode the result into the input series type. 

In the case of perfect hashing, the data structure used to store the result of coded series 
multiplication is a simple array of coefficients. The index of each coefficient resulting 
from each term-by-term multiplication will simply be given by its corresponding code, 
resulting from the addition or subtraction of the input codes. In perfect hashing, 
hence, each term-by-term multiplication consists of a coefficient multiplication, one 
or two integer additions and one or two memory redirections. 

We have to resort to sparse hashing when we cannot use an array to store the 
resulting coefficients, because an array would be either too large to fit in the available 
memory or too costly to allocate and setup because of the high sparseness of the input 
series (which usually translates in high sparseness also for the resulting series). Thus, 
for sparse hashing we have implemented a cache-friendly hash table for the storage of 
coefficient-code pairs that minimises memory allocations and tries to maximise spatial 
locality of reference. Although different in implementation, the logic of term lookup 
and insertion is the same as explained in Figure |4~T| Our implementation is essentially 
a hash table employing separate chaining in which the buckets have fixed sizes and 
are laid out contiguously in a single dynamically-allocated memory area. In such a 
structure a rehash is triggered not when the load factor reaches a certain threshold 
(as it usually happens in standard hash table implementations) but when a bucket is 
completely filled up. To reduce the need for rehashing, a number of buckets is used 
as an overflow area to store coefficient-code pairs belonging to filled-up buckets, so 
that the rehash is dealyed until the overflow buckets are exhausted. The overflow 
buckets must be checked at every lookup operation. According to our tests, this hash 
table variant with a bucket depth of 12 is usually filled up to ~ 30% before needing 
a rehash. 

The design of the hash table used for sparse hashing aims at maximising perfor- 
mance by utilising efficiently the cache memory found on modern computer architec- 
tures. The buckets, for instance, consist of arrays instead of linked lists (as commonly 
implemented in separate chaining) in order to improve spatial locality of reference, 
so that the load of the first element of the bucket from the slow RAM into the fast 
cache memory entails also the loading of a number of successive elements. If a linked 
list were used instead, such prefetching of successive elements would not be possible 
since in linked lists the second and successive elements are generally not contiguous 
to the first one; buckets implemented as linked lists hence require a load operation 
from RAM for every element. 

We adopted two other techniques in order to optimise cache memory utilisation 
for both perfect and sparse hashing: 

1. ordering of coded input series: in both algorithms the codes resulting from 
Kronecker substitution are used to calculate a memory address (they are used 
directly as an index in perfect hashing, while in sparse hashing they are used 
as an index after the application of the modulo operation). If we order the 
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coded input series according to the code and we proceed to term-by-term 
multiplications in two nested for cycles, it follows naturally that in each 
innermost for cycle (i.e., while iterating over the terms of the second series 
having fixed a term in the first one) we are going to write into successive 
memory areas^J Doing this both improves prefetching and helps the cache 
subsystem determine a predictable memory access pattern. According to our 
tests, this optimisation allowed for a reduction in the time needed to multiply 
large series up to 40% with respect to the case in which the input series are 
unordered; 

2. cache blocking: another way to optimise cache memory utilisation is to pro- 
mote temporal locality, i.e., making sure that data is used as often as possible 
in order to avoid eviction from cache memory. This effect can be obtained 
through blocking techniques: during series multiplication, instead of fixing a 
term in the first series and iterating over all the terms of the second series we 
iterate just over a portion of the second series, then moving on to the second 
term of the first series and repeating the procedure. In other words, we divide 
the series in blocks and multiply block-by-block. This ensures that we are 
going to read and write the same memory areas very frequently, thus max- 
imising the permanence in cache memory. In our tests this optimisation can 
lead to substantial speed gains: in the case of large polynomials the reduction 
of the time needed to perform multiplications can reach 70%. 
The version of Piranha benchmarked in ^8] employs the optimisation techniques de- 
scribed in this section. 

6. Nontrivial operations on series. In Celestial Mechanics it is common to 
encounter algebraic expressions that cannot be mapped directly to Poisson series or 
polynomials. An expression often arising is, for instance, the square root 



\fl -e 2 , (6.1) 

where e is an orbital eccentricity. In other cases it may be necessary to compute the 
sine or cosine of a Poisson series. 

Such occurrences are usually dealt with through series expansions. Expression 



(|6.1| may be rewritten in terms of the MacLaurin development 

e 2 " (6.2) 



(2n)! 



(1 ~2n)nPA r 

for |e| < 1. Such an expansion truncated to a finite order is often appropriate since in 
most practical problems of perturbation theory the value of e is small. In Piranha we 
provide two general series expansions that can be useful to reduce to a representable 
algebraic form many common occurrencies of problematic expressions. 

The first series expansion is employed to compute the real power of a series and 
uses the generalised binomial theorem ([3]): 

oo 



fe=0 



(a: + i/) r = 2JU)» r "V, (6-3) 



3 While this is always true for polynomials, in the case of Fourier series there is also a subtraction 
involved. Thus for Fourier series this optimisation may be not as effective as for polynomials. 
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where, for our purposes, r € Sf and x and y are complex numbers. In Piranha the 
real power of a series S is rewritten in terms of a leading term L and a tail series T, 
so that 



(S) r = (L + T) r (6.4) 
\L r ~ k T k . (6.5) 



OO 

' r 



E 

fe=0 



It is the responsibility of the user to make sure that the series is convergent (i.e., r 
is a positive integer or \L\ > \T\). Piranha will choose the leading term adopting 
series-specific criterions: for Poisson series and polynomials, for instance, the leading 
term is the one with lowest total degree, while for Fourier series the leading term is 
the one with the highest coefficient in absolute value. It is worth noting that while 
T is subject only to trivially-implementable natural powers, L must be in general 
amenable to real exponentiation. In case of Poisson series, the real power of the 
leading term L involves the calculation of the real power of its polynomial coefficient, 
so that the binomial expansion may be called again to compute L r ~ k in a somewhat 
recursive fashion. This method of calculation of real powers can be succesfully applied 



to expressions such as (6.1 1 (effectively producing the same expression as (6.2 1) or to 
compute the inverse of a suitable series. 

The other series expansion implemented in Piranha is used to compute the com- 
plex exponential of real Poisson and Fourier series, and it is based upon the Jacobi- 
Anger expansion ([55]): 

oo 

exp [ix cos 8] = i n J n [x) exp [in0] , (6.6) 



where J n (x) is the Bessel function of the first kind of integer order n, which can be 
expressed by the MacLaurin development 



Jn{x) = ^ 



/=() 



(-1)' 

2 2i + n i\( n + iy: 



(6.7) 



By using eq. (6.6 1 and eq. (6.7), and by remembering the property of Bessel functions 

J- n (x) = (-l) n J n (x), (6.8) 

it is possible to transform the complex exponential of a Poisson series into a product 
of Poisson series. From the complex exponential, the cosine and sine of the Poisson 
series can be extracted as the real and imaginary parts respectively. As far as we were 
able to verify in the literature, no other manipulator implements circular functions 
of Poisson series using the Jacobi-Anger expansion. Usually such calculations are 
performed by decomposing the Poisson series into a leading term and a tail (similarly 
to what is done in the binomial expansion) and by expanding the function through 
elementary trigonometric formulas: 



cos S = cos (L + T) 

= cos L cos T — sin L sin T. 
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(6.9) 
(6.10) 



At this point cosT and sinT are expressed through truncated MacLaurin series. An- 
other approach, recently proposed in [38] and based on a Richardson-type elimination 
scheme, has been shown to be advantageous in certain cases with respect to Taylor 
expansions both in terms of accuracy and computing time. 

With the ability to compute real exponentiation and circular functions on series, 
it is possible to implement an array of capabilities consisting of special functions and 
series expansions relevant to Celestial Mechanics applications (thus implementing the 
basis of what is sometimes called a keplerian processor - see [9j and |TT]). Piranha cur- 
rently implements Bessel functions of the first kind, Legendre polynomials, associated 
Legendre functions, (rotated) spherical harmonics and the elliptic expansions for 
cos 

2, E, /, ^ (/) and exp [mE] (as seen in standard Celestial Mechanics textbooks, 
such as [32], Chapter 2). 

7. Python bindings. Piranha's core is a C++ library that can be used within 
any C++ program. Although we have tried to make the use of the library as friendly 
as possible (e.g., by using operator overloading), usage as a C++ library remains 
somehow uncomfortable: a basic knowledge of the C++ programming language is 
required, and every change of the source code in the main routine must be followed by 
a recompilation of the whole program, a task that can become slow and cumbersome in 
the case of generically-programmed libraries like Piranha. Additionally, direct usage 
as a C++ library is not interactive. 

For these reasons we provide a set of Python bindings (called Pyranha) which is 
intended to be the preferred way of interacting with the manipulator for end-users. In 
Pyranha, Piranha's C++ classes are exposed as Python objects with the help of the 
Boost. Python library. It is important to stress that this is not a translation of C++ 
code into Python: even as Python objects, Piranha's classes are still compiled C++ 
code, so that there is no performance penalty in using the manipulator from Python. 

Using Piranha from an interpreted language like Python has several advantages. 
In our opinion, two important ones are the following: 

1. Python ((45]) is an open-source and widely used language, with a standard- 
ised syntax, an easy learning curve and a vast amount of publicly-available 
additional modules; 

2. Python can be used interactively. Through enhanced Python shells and 
graphics libraries it is possible to provide an interactive environment with 
graphical capabilities. Pyranha takes advantage of the features provided by 
the IPython shell and the matplotlib graphic library (j281 139]). 

Among Python's features, of particular relevance for Pyranha are lambda functions, 
which allow to define small functions inline directly when they are called. Lambda 
functions are particularly handy for two useful series methods available in Pyranha. 
The first one is a plotting method, which works in conjunction with the matplotlib 
module. This method takes as parameter a function that specifies what to plot. The 
following Python code snippet uses a lambda function to specify that it is being 
requested to plot the norms of the coefficients of the series f oo: 

foo.plot(key = lambda t: t . cf .norm) 

This means that from every term t of the series, the coefficient's norm is extracted 
and plotted. Another useful method is the filter () method, which is modelled after 
Python's builtin function with the same name, filter () is used to extract from 
series terms satisfying a criterion specified through a lambda function: 
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f oo . f ilt er ( lambda t: t . key . freq > 0) 

In this case we are extracting all terms of a Poisson/Fourier series whose frequency 
(which is a property of the trigonometric key) is positive, filter () can work in a 
recursive fashion in case of nonzero echelon level series. This means that it is possible 
to pass more than one argument: the first argument is used to filter the terms of the 
series, the second argument to filter the terms of each coefficient series in the terms 
surviving from the first filtering, and so on. filter () can be useful in perturbation 
theories, when it is necessary to select the terms of the series which are relevant to 
the physical problem that is being solved. 

8. Benchmarks. In this section we present a few benchmarks on the multi- 
plication of echelon-0 series. The systems benchmarked are: SDMP, a library for 
sparse polynomial multiplication and division (dQ]), TRIP, an algebraic manipulator 
devoted to Celestial Mechanics ([23]), Pari, a computer algebra system for number 
theory (|H]), Magma ([6]), Singular f|24|) and Maple. The usual caveats about bench- 
marking as a method to measure performance apply. Additionally, we must preface 
the following disclaimers: 

• measurements for other systems were taken by SDMP co-author Roman 
Pearce on a Linux server equipped with a Core2 Xeon 3.0 GHz CPU with 
4MB of L2 cache, while Piranha's measurements were taken on a Linux lap- 
top equipped with a Core2 1.8 GHz CPU with 2MB of L2 cache (the GCC 
compiler version 4.3.2 was used to compile Piranha). We have chosen to ex- 
press the results as both the original timings in seconds and clock cycles per 
term-by-term multiplication (ccpm) in an effort to provide a uniform scale. 
It must be noted however that the latter unit of measure fails to give mean- 
ingful results when the computation time is bounded by memory transfers 
(i.e., when most of the time is spent loading data from RAM). In such cases 
the speed of execution of the same task will depend more on the speed of the 
memory subsystem than on the CPU speed; 

• we have chosen to express Piranha's results as sums of two numbers: the first 
number measures the speed of the actual multiplication, while the second 
number represents the time spent unpacking the result from the coded series 
representation back into the standard series representation (as explained at 



the beginning of [5.2.21. The reason for this distinction will be explained in 
the comments about the individual benchmarks; 
• the computer algebra systems tested adopt different representations for co- 
efficients and exponents (or trigonometric multipliers), so the relevance of 
these benchmarks is related to the adequacy of the representations used in 
the various systems with respect to the task at hand. Additionally, there are 
differences also in the representation of series (certain systems, such as Pari 
and TRIP for instance, use a recursive representation for polynomials, while 
SDMP's and Piranha's representations are distributed). 
We would like to gratefully thank Roman Pearce for allowing us to reproduce the 
benchmarks he performed. The original benchmarks are available online at the fol- 
lowing address: 



http : //www. cecm. sf u. ca/~rpearcea/ 



8.1. Fateman's benchmark. The first benchmark is modelled after that pro- 
posed by Richard Fateman in [20], and consists of the calculation of 



s(s + l), 
18 



(8.1) 



System 



Timings (seconds, ccpm) 



Piranha 2008.11 (double-precision) 

SDMP September 2008 (monomial = 1 word) 

TRIP v0.99 (double-precision) 

Pari 2.3.3 (w/ GMP) 

Magma V2.14-7 

Singular 3-0-4 

Maple 11 



Table 8.1 
Timings for Fateman's benchmark. 



where s is defined as the multivariate polynomial 

s= (l + x + y + z + tf v . (8.2) 

The computation is expressed as s (s + 1) instead of, e.g., s 2 + s, to prevent "smart" 
computer algebra systems to calculate s 2 more efficiently than through a multiplica- 
tion (e.g., by multinomial expansion), s and s + 1 consist of 46 376 terms, while the 
resulting polynomial consists of 635 376 terms. The results are displayed in Table [O] 
Some necessary remarks: 

• the densities of the input polynomials are high enough to make the perfect 
hashing algorithm viable; 

• double-precision coefficients, used by Piranha and TRIP, do not allow to 
represent exactly the coefficients of the resulting polynomial; 

• SDMP uses inline assembly for the multiplication of 61 bits input integer 
coefficients into 128 bits integer coefficients. On Core2 CPUs, this is expected 
to cost around 4 times the cost of double-precision multiplication; 

• other systems use arbitrary-precision coefficients. 

In the case of manipulators devoted to Celestial Mechanics, this test represents more 
of a limit case than a computation likely to be encountered in actual applications. 

In the case of Piranha's timing we remark that, since double-precision multipli- 
cation on Core2 CPUs costs around 4 clock cycles when working in cache, the use of 
the perfect hashing algorithm allows to approach the theoretical speed limit for this 
kind of computation performed with classical algorithms. When disabling the cache 
optimisations described in |5.2.2| the running time for this benchmark is around 36 
seconds. We also note how the time needed to unpack the resulting coded polyno- 
mial (0.6 s) is a small fraction of the time needed for the multiplication of the input 
polynomials. Finally, for comparison we note that if Piranha is benchmarked using 
multiprecision GMP integer coefficients, the running time for this benchmark becomes 
90 s (or 75 ccpm). 

8.2. Sparse polynomial multiplication benchmark. This benchmark was 
proposed originally by Michael Monagan and Roman Pearce, and consists of the 
following polynomial multiplication: 

(1 + x + y + 2z 2 + 3t 3 + 5w 5 ) 12 • (l + u + t + 2z 2 + 3y 3 + 5x 5 ) 12 . (8.3) 

Since the polynomials being multiplied consist of 6 188 terms each and the result 
consists of 5 821335 terms, this multiplication is much more sparse than the one 
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5.4 + 0.6, 4.5 + 0.5 
47, 65 
44, 61 
512, 714 
679, 947 
1482, 2067 
15986, 22298 



System 



Timings (seconds, ccpm) 



Piranha 2008.11 (double-precision) 

SDMP September 2008 (monomial = 1 word) 

TRIP v0.99 (double-precision) 

Pari 2.3.3 (w/ GMP) 

Magma V2.14-7 

Singular 3-0-4 

Maple 11 



3 + 6.8, 141 + 319 



333, 26089 



1.5, 121 
1.9, 148 
54, 4231 
24, 1880 
59, 4622 



Table 8.2 



Timings for sparse polynomial multiplication. 



featured in the previous benchmark. For Piranha this means that the perfect hashing 
algorithm is not viable and that sparse hashing must be employed instead. The results 
are displayed in Table |8.2| 




The most interesting remark for Piranha is the time spent unpacking the coded 
series back into the standard series representation, which amounts to more than twice 
the time spent for the actual multiplication. This behaviour is the combined effect of 
the following causes: 

• we are using a hash table with standard semantics, which means that each 
insertion function will result in a lookup of the term. This lookup is not neces- 
sary because we know that by construction the resulting coded series features 
unique terms. By implementing a custom hash table with an alternative 
unchecked insertion function we could avoid pointless term comparisons; 

• to decode the coded series we need to apply repeatedly costly integer divisions 
and modulo operations, which are particularly cumbersome in this benchmark 
given the high ratio between the number of term-by-term multiplications and 
the size of the resulting polynomial. For this specific benchmark we could 
avoid codification altogether since there are no negative exponents and the 
exponents' values are bounded in the [0, 120] interval. It follows then that in 
this case we could simply pack the exponents as, e.g., 8 bits integers into a 64 
bits word and obtain the same efficiency as Kronecker substitution, without 
the need to apply division and modulo operations for the decodification. This 
optimisation will be implemented in a future version of Piranha; 

• the hash_set data structure needs to perform many memory allocations. 
While this issue can be mitigated by using a pool memory allocator, this 
solution does not avoid frequent cache-unfriendly memory accesses. 

As a test, we have tried to see what happens if, instead of unpacking the series into a 
hash_set, we simply copy the codes into a plain array. This approach is similar to that 
adopted by some manipulators which employ plain array-like storage methodologies. 
The result is that the total running time is reduced to 3.2 s, or 150 ccpm. Although 
this approach would allow to consistently increase performance, for the time being we 
are oriented towards maintaining the current implementation. In addition to the fact 
that it is possible to improve performance with the techniques described above (hash 
table implementation with additional insertion semantics, avoiding needless encoding, 
pool memory allocator), the current programming model is in our opinion convenient 
and clear and allows for capabilities, such as comparison of series in linear time and 
identification of terms in constant time, which have proven to be useful in practice. 
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System 



Timings (seconds) 



Piranha 2008.11 (double-precision) 0.495 
TRIP v0.99 (double-precision) 1.352 



Table 8.3 

Timings for Fourier series multiplication benchmark. 



8.3. Fourier series multiplication. This benchmarks consists of the squaring 
of the ELP3 Fourier series representing the solution of the main problem of the lunar 
theory for the Moon's distance in the ELP2000 theory Q14J). The ELP3 series consists 
of 702 terms and the squaring produces a series with 11 673 term^] The squaring is 
performed 20 times in order to minimise the effect of the time needed to load the series 
from file. The sparseness of the ELP3 Fourier series, despite higher than in Fateman's 
benchmark, is still low enough to make the perfect hashing algorithm viable. 

We report in Table [873| the timings for Piranha and TRIP. We report just the plain 
timings because TRIP's results, communicated to us by TRIP's co-author Mickael 
Gastineau, were taken on a CPU which is slower than the Core2 CPU used for Piranha 
and whose characteristics we are not aware of. According to our tests, the Core2 CPU 
should be ~ 1.5 times faster. 

In Piranha's case the timings translate in roughly 90 clock cycles per term-by-term 
multiplication. It is then evident that here the overhead of decoding the coded series is 
dominating, as the multiplication of Fourier series terms consists of a double-precision 
multiplic ation (~ 4 clock cycles) and a double-precision division (> 6 clock cycles), as 



per eqs. (|5.7|. To confirm this interpretation we have computed first ELP3 2 • ELP3 2 



and then ELP3 4 • ELP3 4 (where ELP3 4 consists of 192 000 terms). For these other 
two benchmarks we measured a total cost of 15 ccpm and 11 ccpm respectively, thus 
confirming that the cache-optimised perfect hashing algorithm allows to reach a high 
throughput, especially for long series. 

9. Conclusions, future work and availability. In this contribution we have 
provided an overview of the architecture upon which the Piranha algebraic manipu- 
lator is built. We have described the high-level representation of series, the storage 
method for terms, based on hashing techniques, and the algorithms used during series 
arithmetic. We have shown how Kronecker substitution can be applied also to the 
multiplication of Fourier series and how it allows to reach promising performance for 
series multiplication. We also have briefly described how using Piranha from Python 
can open up interesting possibilities of interaction with series objects, and how it 
is possible to develop nontrivial functions of series in practical Celestial Mechanics 
problems. 

Future work on Piranha will target the following areas of interest: 

• implementation of echeloned Poisson series; 

• implementation of multiprecision floating-point coefficients (possibly through 
the MPFR library, see pi]); 

• implementation of interval arithmetic for both double-precision and multi- 
precision coefficients; 



4 The exact number of the resulting terms may be subject to variations in the order of few unities, 
depending on the value of the term ignorability threshold. 
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enhancements for the keplerian processor and implementation of standard 
algorithms of perturbation theories in the context of Celestial Mechanics, 
performance improvements for the multiplication of very sparse series (as 



explained in {8.2 I; 

• parallelisation for the most time-consuming operations; 

• enhancements for the Python bindings, in order to improve interactivity, ease 
of use and flexibility (also by means of graphical user interface elements). 

We are also evaluating the possibility to provide some form of interaction between Pi- 
ranha and general-purpose computer algebra systems. We are particularly interested 
in the possibility to interact with the open-source mathematics software Sage f|52]). 

Piranha is freely available under the terms of the GNU public license, and it can 
be downloaded from the website 

http : //piranha . tuxf amily . org 

The manipulator runs on Unix and Windows platforms, and it can be compiled 
with a reasonably recent version of a C++ compiler (the GCC, Intel and Visual 
Studio compilers have been successfully used so far). We hope to be able to gather 
enough interest around Piranha to establish a community of users and developers in 
the future. 
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